Glassdoor Comments and Ratings Analysis

Katherine Li

Table of Contents


I. Project Description

This project is based on a dataset from Glass Door. It is built for various purposes:
(1) Exploration about the unigram term frequency, tf-idf, and bigram presence of the text in pro reviews, con reviews, and advice.
(2) What rating type has an outstanding pattern other than the rest and is worthy of further inspection?
(3) How does the sentiment scores shown in pro reviews, con reviews, and advice affect the rating of interest?
(4) Topic modelings for pro reviews, con reviews, and advice.
(5) For the topics modeled, how does rating impact the mapping of such topics?

II. Relevant Packages

The following packages are called in for this project.


library(dplyr)
library(tidyverse)
library(stringr)
library(tidytext)
library(tm)
library(rmarkdown)
library(wordcloud2)
library(lexicon)
library(textdata)
library(gganimate)
library(httr)
library(textstem)
library(widyr)
library(stopwords)
library(tibble)
library(NMF)
library(fmsb)
library(stm)
library(ggpubr)
library(lme4)

III. The Data

A. Import & Overview

Load in the data.


load("/Users/katliyx/Documents/SPRING2020/ITAO70250-Unstructured/Homework/HW2/glassDoor.rdata")

rmarkdown::paged_table(glassDoor)

There is no NA’s in the overall rating. This might be the rating of interest for further inspection given this perk. As the numbers of NA entries vary for other types of ratings that include work life rating, culture value rating, career opportunity rating, company benefit rating, and management rating.


summary(glassDoor)

     pros               cons              advice          rating   
 Length:1831        Length:1831        Length:1831        4.0:461  
 Class :character   Class :character   Class :character   3.0:370  
 Mode  :character   Mode  :character   Mode  :character   5.0:551  
                                                          2.0:170  
                                                          1.0:279  
                                                                   
                                                                   
 workLifeRating cultureValueRating careerOpportunityRating
 4.0    :431    5.0    :526        5.0    :393            
 5.0    :407    4.0    :340        4.0    :388            
 3.0    :363    1.0    :310        3.0    :342            
 1.0    :231    3.0    :302        1.0    :284            
 2.0    :195    2.0    :151        2.0    :202            
 (Other): 59    (Other): 48        (Other): 64            
 NA's   :145    NA's   :154        NA's   :158            
 compBenefitsRating managementRating organization      
 4.0    :434        2.0 :160         Length:1831       
 3.0    :406        3.0 :275         Class :character  
 5.0    :327        4.0 :313         Mode  :character  
 2.0    :237        1.0 :342                           
 1.0    :229        5.0 :398                           
 (Other): 33        NA's:343                           
 NA's   :165                                           

B. Cleaning & Preprocessing

Missing Values

Check if there is any missing entries in columns pros, cons, and advice. There are 1831 rows of entries, while there is none missing in column pros and cons; there are 655 NA’s in column advice, which is a 35.77% of the total rows of data - this is maybe something that we need to come back to later.


sum(is.na(glassDoor$pros))

[1] 0

sum(is.na(glassDoor$cons))

[1] 0

sum(is.na(glassDoor$advice))

[1] 655

Data Type

An extra step taken to make sure the data types are in shape for further inspection.


glassDoor <- glassDoor %>%
  mutate(organization = as.factor(organization)) %>%
  mutate_at(c("rating", "workLifeRating", "cultureValueRating", "careerOpportunityRating", "compBenefitsRating", "managementRating"), as.numeric)

Issues within the Text in Pros, Cons, and Advice

There are several issues with the text in pros, cons and advice: i.e. some words run together, non-english text and maybe more. These issues need to be addressed since collectively, they might affect the result of text analytics to a certain extent. Plus, stop words are removed for preparation purposes.


#library(stringi)

glassDoor$pros <- str_replace_all(glassDoor$pros, "[^[:alnum:]]", " ") %>% 
  str_squish(.) %>% 
  gsub("([a-z])([A-Z])", "\\1 \\2", .) %>% 
  tm::removeWords(., words = stopwords("en")) %>% 
  lemmatize_strings(.) %>% 
  tm::removeNumbers(.)

glassDoor$cons <- str_replace_all(glassDoor$cons, "[^[:alnum:]]", " ") %>% 
  str_squish(.) %>% 
  gsub("([a-z])([A-Z])", "\\1 \\2", .) %>% 
  tm::removeWords(., words = stopwords("en")) %>% 
  lemmatize_strings(.) %>% 
  tm::removeNumbers(.)

glassDoor$advice <- str_replace_all(glassDoor$advice, "[^[:alnum:]]", " ") %>% 
  str_squish(.) %>% 
  gsub("([a-z])([A-Z])", "\\1 \\2", .) %>% 
  tm::removeWords(., words = stopwords("en")) %>% 
  lemmatize_strings(.) %>% 
  tm::removeNumbers(.)

#glassDoor$pros <- stringi::stri_trans_general(glassDoor$pros, "latin-ascii")
#glassDoor$cons <- stringi::stri_trans_general(glassDoor$cons, "latin-ascii")
#glassDoor$advice <- stringi::stri_trans_general(glassDoor$advice, "latin-ascii")

#glassDoor$pros <- iconv(glassDoor$pros, to = "ASCII")

# remove the non-english text in pros, cons, and advice

germanText = grepl("den|der|die|ich", glassDoor$pros)
germanText = grepl("den|der|die|ich", glassDoor$cons)
germanText = grepl("den|der|die|ich", glassDoor$advice)

germanRows = which(germanText)

glassDoor = glassDoor[-c(germanRows), ]

Create a new column as ID No.

Just for the convenience in future endeavor.


glassDoor$id <- seq.int(nrow(glassDoor))

IV. Word CloudS

To start with, who doesn’t love clouds?

Pros


glassDoor %>%
  dplyr::select(pros) %>%
  unnest_tokens(word, pros) %>%
  count(word, sort = TRUE) %>%
  anti_join(stop_words, by = "word") %>%
  filter(n>50) %>%
  na.omit() %>%
  wordcloud2::wordcloud2(color = "darkseagreen", shape = "circle")

Cons


glassDoor %>%
  dplyr::select(cons) %>%
  unnest_tokens(word, cons) %>%
  count(word, sort = TRUE) %>%
  anti_join(stop_words, by = "word") %>%
  filter(n>50) %>%
  na.omit() %>%
  wordcloud2::wordcloud2(color = "darksalmon", shape = "circle")

Advice


glassDoor %>%
  dplyr::select(advice) %>%
 # filter(!is.na(advice)) %>%
  unnest_tokens(word, advice) %>%
  count(word, sort = TRUE) %>%
  anti_join(stop_words, by = "word") %>%
  filter(n>15) %>%
  na.omit() %>%
  wordcloud2::wordcloud2(color = "cadetblue3", shape = "circle")

V. Term Frequency

Pros

The table shows the top 15 words with highest term frequency in the pro reviews for each organization. Words like “company”, “people”, “managment” remain high on the list almost for each organization.


pro <- glassDoor %>%
  select(pros, organization) %>%
  mutate(pros = tolower(pros),
         pros = lemmatize_strings(pros))

pro2 <- pro %>%
  unnest_tokens(word, pros) %>%
  count(organization, word, sort = TRUE) %>%
  anti_join(stop_words, by = "word")

proTF <- pro %>%
  split(., .$organization) %>%
  lapply(., function(x) {
    prosTokens = tm::MC_tokenizer(x$pros)
    total = length(prosTokens)
  })

temppro <- unlist(proTF)

proTF <- data.frame(organization = names(temppro),
                             total = temppro)

rownames(proTF) <- NULL

proTF <- pro2 %>%
  left_join(., proTF, by = "organization")

proTF$tfreq <- proTF$n/proTF$total

proTFtop <- proTF %>%
  group_by(organization) %>%
  arrange(desc(tfreq)) %>%
  slice(1:15)

rmarkdown::paged_table(proTFtop)

Cons

The table shows the top 15 words with highest term frequency in the con reviews for each organization. Words like “company”, “employee”, “managment” remain high on the list almost for each organization - pretty similar to those in the pro reviews.


con <- glassDoor %>%
  select(cons, organization) %>%
  mutate(cons = tolower(cons),
         cons = lemmatize_strings(cons))

con2 <- con %>%
  unnest_tokens(word, cons) %>%
  count(organization, word, sort = TRUE) %>%
  anti_join(stop_words, by = "word")

conTF <- con %>%
  split(., .$organization) %>%
  lapply(., function(x) {
    consTokens = tm::MC_tokenizer(x$cons)
    total = length(consTokens)
  })

tempcon <- unlist(conTF)

conTF <- data.frame(organization = names(tempcon),
                             total = tempcon)

rownames(conTF) <- NULL

conTF <- con2 %>%
  left_join(., conTF, by = "organization")

conTF$tfreq <- conTF$n/conTF$total

conTFtop <- conTF%>%
  group_by(organization) %>%
  arrange(desc(tfreq)) %>%
  slice(1:15)

rmarkdown::paged_table(conTFtop)

Advice

There are plenty NA’s in the column advice. Establishing the term frequency table without putting these NA’s aside would result in NA being of the highest term frequency for all four organizations. Thus, get rid of those NA’s here.

The table shows the top 15 words with highest term frequency in the advice reviews for each organization. The words are practically the same as pro and con reviews for each company; to iterate, they are “company”, “employee”, “management”, and “people”.


advicedf <- glassDoor %>%
  select(advice, organization) %>%
  filter(!is.na(advice)) %>%
  mutate(advice = tolower(advice),
         advice = lemmatize_strings(advice))

advicedf2 <- advicedf %>%
  unnest_tokens(word, advice) %>%
  count(organization, word, sort = TRUE) %>%
  anti_join(stop_words, by = "word")

adviceTF <- advicedf %>%
  split(., .$organization) %>%
  lapply(., function(x) {
    adviceTokens = tm::MC_tokenizer(x$advice)
    total = length(adviceTokens)
  })

tempadvice <- unlist(adviceTF)

adviceTF <- data.frame(organization = names(tempadvice),
                             total = tempadvice)

rownames(adviceTF) <- NULL

adviceTF <- advicedf2 %>%
  left_join(., adviceTF, by = "organization")

adviceTF$tfreq <- adviceTF$n/adviceTF$total

adviceTFtop <- adviceTF %>%
  group_by(organization) %>%
  arrange(desc(tfreq)) %>%
  slice(1:15)

rmarkdown::paged_table(adviceTFtop)

VI. tf-idf

In addition to term frequency, tf-idf is another valued concept to look at because it implies the words with greater importance as to the document.

Pros

This table shows the top 10 most “important” words in the pro reviews for each organization.


#proIDF <- proTF %>%
#  group_by(word) %>%
#  count() %>%
#  mutate(idf = log(length(unique(proTF$organization))/n)) %>%
#  arrange(desc(idf))

# rmarkdown::paged_table(proIDF)

proTFIDF <- proTF %>%
  tidytext::bind_tf_idf(word, organization, n) %>%
  arrange(desc(tf_idf)) %>%
  select(-tfreq)

# overall
# proTFIDF

proTFIDFtop <- proTFIDF %>%
  group_by(organization) %>%
  arrange(desc(tf_idf)) %>%
  slice(1:10)

rmarkdown::paged_table(proTFIDFtop)

Cons

This table shows the top 10 most “important” words in the con reviews for each organization. Interesting to see words like “bill” and “budget” pop up.


conTFIDF <- conTF %>%
  tidytext::bind_tf_idf(word, organization, n) %>%
  arrange(desc(tf_idf)) %>%
  select(-tfreq)

conTFIDFtop <- conTFIDF %>%
  group_by(organization) %>%
  arrange(desc(tf_idf)) %>%
  slice(1:10)

rmarkdown::paged_table(conTFIDFtop)

Advice

This table shows the top 10 most “important” words in the advice for each organization.


adviceTFIDF <- adviceTF %>%
  tidytext::bind_tf_idf(word, organization, n) %>%
  arrange(desc(tf_idf)) %>%
  select(-tfreq)

adviceTFIDFtop <- adviceTFIDF %>%
  group_by(organization) %>%
  arrange(desc(tf_idf)) %>%
  slice(1:10)

rmarkdown::paged_table(adviceTFIDFtop)

VII. N-grams (Bigrams)

Pros

This table shows the bigrams with the highest presence in pro reviews, such as “life balance”, “learn lot”, “opportunity learn”, and “lot opportunity”.


proBigram <- glassDoor %>%
  select(organization, pros) %>%
  mutate(pros = tolower(pros)) %>%
  mutate(pros = lemmatize_strings(pros)) %>%
  unnest_tokens(bigram, pros, token = "ngrams", n = 2)

proBigram2 <- proBigram %>%
  separate(bigram, c("word1", "word2"), sep = " ")

proBigram3 <- proBigram2 %>%
  filter(!(word1 %in% stop_words$word)) %>% 
  filter(!(word2 %in% stop_words$word))

proBigramFinal <- proBigram3 %>%
  unite(bigram, word1, word2, sep = " ") %>%
  count(bigram, sort = TRUE)

rmarkdown::paged_table(proBigramFinal)

Cons

This table shows the bigrams with the highest presence in con reviews. “Life balance”, “senior management” show up again. And “low pay” and “low salary” are quite expected on this list.


conBigram <- glassDoor %>%
  select(organization, cons) %>%
  mutate(cons = tolower(cons)) %>%
  mutate(cons = lemmatize_strings(cons)) %>%
  unnest_tokens(bigram, cons, token = "ngrams", n = 2)

conBigram2 <- conBigram %>%
  separate(bigram, c("word1", "word2"), sep = " ")

conBigram3 <- conBigram2 %>%
  filter(!(word1 %in% stop_words$word)) %>% 
  filter(!(word2 %in% stop_words$word))

conBigramFinal <- conBigram3 %>%
  unite(bigram, word1, word2, sep = " ") %>%
  count(bigram, sort = TRUE)

rmarkdown::paged_table(conBigramFinal)

Advice

This table shows the bigrams with the highest presence in advice; interesting to see bigrams like “treat employee”, “care employee”, “listen employee”, “pay attention”, and “hr policy”.


adviceBigram <- glassDoor %>%
  select(organization, advice) %>%
  filter(!is.na(advice)) %>% # get rid of the NA's
  mutate(advice = tolower(advice)) %>%
  mutate(advice = lemmatize_strings(advice)) %>%
  unnest_tokens(bigram, advice, token = "ngrams", n = 2)

adviceBigram2 <- adviceBigram %>%
  separate(bigram, c("word1", "word2"), sep = " ")

adviceBigram3 <- adviceBigram2 %>%
  filter(!(word1 %in% stop_words$word)) %>% 
  filter(!(word2 %in% stop_words$word))

adviceBigramFinal <- adviceBigram3 %>%
  unite(bigram, word1, word2, sep = " ") %>%
  count(bigram, sort = TRUE)

rmarkdown::paged_table(adviceBigramFinal)

VIII. A Peak into All the Ratings

To get a basic idea the average of all ratings (overall rating, work life rating, culture value rating, career opporunity rating, company benefit rating, and management rating) by organization/company, the following plot is generated. This step is conducted to see if any type of rating stands out, and is worthy of further inspection of the relationship between THE rating and sentiment/topic modeling.

From the result, we could see that the averages of all ratings are look alike to each other for each organization. Thus, the OVERALL RATING is picked as the factor to be of further inspection as this project proceeds.


avg_rating <- glassDoor %>%
  group_by(organization) %>%
  summarize(rating_avg = round(mean(rating, na.rm = TRUE), 2),
            workLifeRating_avg = round(mean(workLifeRating, na.rm = TRUE), 2),
            cultureValueRating_avg = round(mean(cultureValueRating, na.rm = TRUE), 2),
            careerOpportunityRating_avg = round(mean(careerOpportunityRating, na.rm = TRUE), 2),
            compBenefitsRating_avg = round(mean(compBenefitsRating, na.rm = TRUE), 2),
            managementRating_avg = round(mean(managementRating, na.rm = TRUE), 2))

set.seed(1234)

avg_rating_temp <- avg_rating %>% remove_rownames %>% column_to_rownames(var = "organization") %>%
  as.data.frame() 

avg_rating_temp <- rbind(rep(5,5), rep(0,5), avg_rating_temp)

radarchart(avg_rating_temp, axistype = 1, cglcols = "grey", cglty=1, axislabcol="grey", caxislabels=seq(0,20,5), cglwd=0.8, vlcex = 0.7)

IX. Sentiment Analysis

“nrc” lexicon is utilized here to conduct the sentiment analysis.

Preparation work


#get_sentiments("nrc")

nrcWord <- textdata::lexicon_nrc()
nrcValues <- lexicon::hash_sentiment_nrc

# average sentiment per id in cons
conSenti <- glassDoor %>%
  select(cons, rating, organization, id) %>%
  unnest_tokens(tbl = ., output = word, input = cons) %>%
  inner_join(nrcValues, by = c('word' = 'x')) %>%
  group_by(id) %>%
  summarise(con_sentiment = round(mean(y),2))

#Average sentiment per id in pros

proSenti <- glassDoor %>%
  select(pros, rating, organization, id) %>%
  unnest_tokens(tbl = ., output = word, input = pros) %>%
  inner_join(nrcValues, by = c('word' = 'x')) %>%
  group_by(id) %>%
  summarise(pro_sentiment = round(mean(y),2))

# Average sentiment per id in advice

adviceSenti <- glassDoor %>%
  select(advice, rating, organization, id) %>%
  unnest_tokens(tbl = ., output = word, input = advice) %>%
  inner_join(nrcValues, by = c('word' = 'x')) %>%
  group_by(id) %>%
  summarise(advice_sentiment = round(mean(y),2))

Merge the three tables together.


glassDoor2 <- glassDoor %>%
  left_join(conSenti, by = "id") %>%
  left_join(proSenti, by = "id") %>%
  left_join(adviceSenti, by = "id")

Get the table of the average sentiment per organization by pros, cons, and advice

From the table, we could see that just as expected the average sentiment scores assigned to the con reviews are much lower than those assigned to pro reviews and advice for every organization. The sentiment scores reflected from advice are slightly lower than the scores for pro reviews for every organization - which makes sense since the tone used to write advice tends to be more neutral from common sense.

Specifically, ORGC has the lowest sentiment scores (towards negativity) in both con and pro reivews. The sentiment for advice is most negative/lowest for ORGB, but it is one of the two organizations with the highest sentiment scores in pro reviews.


avgSenti <- glassDoor2 %>%
  group_by(organization) %>%
  summarize (con_sentiment_avg = round(mean(con_sentiment, na.rm = TRUE), 2),
             pro_sentiment_avg = round(mean(pro_sentiment, na.rm = TRUE), 2),
             advice_sentiment_avg = round(mean(advice_sentiment, na.rm = TRUE), 2))

rmarkdown::paged_table(avgSenti)

# merge the table showing the average ratings by organization and this table showing the average sentiment scores of pros, cons, and advice by organization
glassDoor_org <- avgSenti %>%
  left_join(avg_rating, by = "organization") %>%
  mutate(organization = as.factor(organization))

X. A Peak into All the Ratings

To get a basic idea the average of all ratings (overall rating, work life rating, culture value rating, career opporunity rating, company benefit rating, and management rating) by organization/company, the following plot is generated. This step is conducted to see if any type of rating stands out, and is worthy of further inspection of the relationship between THE rating and sentiment/topic modeling.

From the result, we could see that the averages of all ratings are look alike to each other for each organization. Thus, the OVERALL RATING is picked as the factor to be of further inspection as this project proceeds.


avg_rating <- glassDoor %>%
  group_by(organization) %>%
  summarize(rating_avg = round(mean(rating, na.rm = TRUE), 2),
            workLifeRating_avg = round(mean(workLifeRating, na.rm = TRUE), 2),
            cultureValueRating_avg = round(mean(cultureValueRating, na.rm = TRUE), 2),
            careerOpportunityRating_avg = round(mean(careerOpportunityRating, na.rm = TRUE), 2),
            compBenefitsRating_avg = round(mean(compBenefitsRating, na.rm = TRUE), 2),
            managementRating_avg = round(mean(managementRating, na.rm = TRUE), 2))

set.seed(1234)

avg_rating_temp <- avg_rating %>% remove_rownames %>% column_to_rownames(var = "organization") %>%
  as.data.frame() 

avg_rating_temp <- rbind(rep(5,5), rep(0,5), avg_rating_temp)

radarchart(avg_rating_temp, axistype = 1, cglcols = "grey", cglty=1, axislabcol="grey", caxislabels=seq(0,20,5), cglwd=0.8, vlcex = 0.7)

XI. How does sentiment impact the rating?

Model 1

From the output of this mode, we could see that pro sentiment and advice sentiment have significant impact on the rating; while con sentiment also impacts rating significantly but probably not on par with the effect from pro and advice.


glassDoor_mod <- glassDoor %>%
  left_join(conSenti, by = "id") %>%
  left_join(proSenti, by = "id") %>%
  left_join(adviceSenti, by = "id") 

mod1 <- lm(rating ~ pro_sentiment + con_sentiment+ advice_sentiment + organization, data = glassDoor_mod)

summary(mod1)

Call:
lm(formula = rating ~ pro_sentiment + con_sentiment + advice_sentiment + 
    organization, data = glassDoor_mod)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.6024 -1.2921 -0.0133  1.2032  2.5402 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       4.02889    0.15999  25.182  < 2e-16 ***
pro_sentiment    -0.77921    0.13914  -5.600 3.17e-08 ***
con_sentiment    -0.28687    0.09512  -3.016  0.00266 ** 
advice_sentiment -0.43610    0.09986  -4.367 1.47e-05 ***
organizationORGB -0.03360    0.14973  -0.224  0.82254    
organizationORGC -0.08988    0.15061  -0.597  0.55086    
organizationORGD -0.17007    0.15124  -1.124  0.26122    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.379 on 645 degrees of freedom
  (1025 observations deleted due to missingness)
Multiple R-squared:  0.1053,    Adjusted R-squared:  0.09695 
F-statistic: 12.65 on 6 and 645 DF,  p-value: 1.611e-13

Model 2

Normal Distribution Test

To test the if the distribution of the dependent variable is normal.

Well… it might seem a bit far-fetched to do the mixed model on this issue as the normality of the distribution of rating appears as follows.


ggdensity(glassDoor_mod$rating, fill = "lightgrey")


ggqqplot(glassDoor_mod$rating)

The Actual Model

A mixed model approach.

From the estimate of the variance explained by the random effect - it is indistinguishable from zero, so random effect probably doesn’t matter and a regular linear model will serve the purpose here instead. From the estimate of the fixed effect, we can see that the rating does differ given different pro, con, and advice sentiment. In other words, sentiment from pro, con, and advice all impact the rating.


mod2 <- lmer(rating ~ pro_sentiment + con_sentiment + advice_sentiment + (1|organization), data = glassDoor_mod, REML = FALSE)

summary(mod2)

Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: 
rating ~ pro_sentiment + con_sentiment + advice_sentiment + (1 |  
    organization)
   Data: glassDoor_mod

     AIC      BIC   logLik deviance df.resid 
  2276.1   2302.9  -1132.0   2264.1      646 

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.59850 -0.92862 -0.02767  0.85930  1.85368 

Random effects:
 Groups       Name        Variance Std.Dev.
 organization (Intercept) 0.000    0.000   
 Residual                 1.886    1.373   
Number of obs: 652, groups:  organization, 4

Fixed effects:
                 Estimate Std. Error t value
(Intercept)       3.96169    0.13457  29.439
pro_sentiment    -0.78227    0.13849  -5.649
con_sentiment    -0.29193    0.09432  -3.095
advice_sentiment -0.43335    0.09940  -4.360

Correlation of Fixed Effects:
            (Intr) pr_snt cn_snt
pro_sentmnt -0.718              
con_sentmnt -0.171 -0.111       
advc_sntmnt -0.395 -0.132 -0.049
convergence code: 0
boundary (singular) fit: see ?isSingular

XII. Topic Models

Preparation work


tmA <- glassDoor %>%
  filter(organization == "ORGA") %>%
  summarise(organization = "ORGA",
            pros = paste(pros, collapse = " "),
            cons = paste(cons, collapse = " "),
            advice = paste(advice, collapse = " "),
            rating = mean(rating, na.rm = TRUE))

tmB <- glassDoor %>%
  filter(organization == "ORGB") %>%
  summarise(organization = "ORGB",
            pros = paste(pros, collapse = " "),
            cons = paste(cons, collapse = " "),
            advice = paste(advice, collapse = " "),
            rating = mean(rating, na.rm = TRUE))

tmC <- glassDoor %>%
  filter(organization == "ORGC") %>%
  summarise(organization = "ORGC",
            pros = paste(pros, collapse = " "),
            cons = paste(cons, collapse = " "),
            advice = paste(advice, collapse = " "),
            rating = mean(rating, na.rm = TRUE))

tmD <- glassDoor %>%
  filter(organization == "ORGD") %>%
  summarise(organization = "ORGD",
            pros = paste(pros, collapse = " "),
            cons = paste(cons, collapse = " "),
            advice = paste(advice, collapse = " "),
            rating = mean(rating, na.rm = TRUE))

glassDoor_tm <- rbind(tmA, tmB, tmC, tmD)

Pro Reviews

Preparation work


pro_tm <- glassDoor_tm %>% as.data.frame()
colnames(pro_tm) <- c("doc_id", "text", "rating")
proCorpus = Corpus(DataframeSource(pro_tm))

proTDM <- TermDocumentMatrix(proCorpus, control = list(weighting = 
                                                         function(x)
                                                           weightTfIdf(x, normalize = FALSE)))

inspect(proTDM)

<<TermDocumentMatrix (terms: 3502, documents: 4)>>
Non-/sparse entries: 4221/9787
Sparsity           : 70%
Maximal term length: 25
Weighting          : term frequency - inverse document frequency (tf-idf)
Sample             :
           Docs
Terms             ORGA       ORGB ORGC       ORGD
  delivery    0.000000  10.000000    0   0.000000
  ensure      0.000000   0.000000   10   0.000000
  find        6.225562   2.905262    0   4.150375
  fmc         8.000000   0.000000    0   2.000000
  orga      132.000000   0.000000    0   0.000000
  orgb        0.000000 112.000000    0   0.000000
  orgc        0.000000   0.000000  144   0.000000
  orgcpany    0.000000   0.000000   10   0.000000
  orgd        0.000000   0.000000    0 180.000000
  something   4.000000   8.000000    0   0.000000

proConvert <- as.data.frame(as.matrix(proTDM))

proTibble = as_tibble(proConvert, .name_repair = "universal")

proTibble <- proTibble %>%
  mutate_all(., funs(. + .1))

rownames(proTibble) <- rownames(proConvert)

proNMF <- nmf(proTibble, 4, seed = 1001)

save(proNMF, file = "/Users/katliyx/Documents/SPRING2020/ITAO70250-Unstructured/Homework/HW2/nmfOut.Rata")

load("/Users/katliyx/Documents/SPRING2020/ITAO70250-Unstructured/Homework/HW2/nmfOut.Rata")

wMatrix = as.data.frame(basis(proNMF))

head(wMatrix[order(-wMatrix$V1), ], 5)

                    V1           V2           V3           V4
orgb         15438.749 2.220446e-16 2.220446e-16 3.908536e-06
delivery      1390.977 1.413403e+01 1.098141e+01 1.611776e+01
enviornment   1115.531 1.441116e+01 1.119673e+01 1.643379e+01
satisfaction  1115.531 1.441116e+01 1.119673e+01 1.643379e+01
something     1115.114 1.394087e+01 4.943255e+02 1.589749e+01

head(wMatrix[order(-wMatrix$V2), ], 5)

                      V1         V2           V3           V4
orgc        3.668671e-05 22413.2119 2.220446e-16 2.220446e-16
ensure      1.279024e+01  1570.9150 1.122066e+01 1.646891e+01
orgcpany    1.279024e+01  1570.9150 1.122066e+01 1.646891e+01
orgcployees 1.298114e+01  1259.8359 1.138813e+01 1.671471e+01
attention   1.317204e+01   948.7569 1.155560e+01 1.696052e+01

head(wMatrix[order(-wMatrix$V3), ], 5)

                 V1           V2         V3           V4
orga   2.220446e-16 1.743540e-09 15955.3089 2.220446e-16
fmc    1.275900e+01 1.440669e+01   978.2945 3.714140e+02
find   4.129018e+02 1.402733e+01   763.6384 7.526571e+02
air    1.311997e+01 1.481427e+01   736.7512 1.689348e+01
ardent 1.311997e+01 1.481427e+01   736.7512 1.689348e+01

head(wMatrix[order(-wMatrix$V4), ], 5)

                   V1           V2        V3        V4
orgd     2.220453e-16 2.220816e-16  10.15161 31948.676
sharona  1.313385e+01 1.482995e+01  11.97329  1436.852
next     1.302973e+01 1.471238e+01 132.75550  1436.718
ceos     1.328657e+01 1.500239e+01  11.99447  1082.064
critical 1.328657e+01 1.500239e+01  11.99447  1082.064

set.seed(1001)

holdoutRows <- sample(1:nrow(glassDoor_mod), 100, replace = FALSE)

pro_text <- textProcessor(documents = glassDoor_mod$pros[-c(holdoutRows)],
                          metadata = glassDoor_mod[-c(holdoutRows), ],
                          stem = FALSE)

Building corpus... 
Converting to Lower Case... 
Removing punctuation... 
Removing stopwords... 
Removing numbers... 
Creating Output... 

pro_prep = prepDocuments(documents=pro_text$documents,
                         vocab = pro_text$vocab,
                         meta = pro_text$meta)

Removing 1787 of 3321 terms (1787 of 20611 tokens) due to frequency 
Removing 1 Documents with No Words 
Your corpus now has 1576 documents, 1534 terms and 18824 tokens.

To determine the best number of topics, focus on the semantic coherence and the residuals. Semantic coherence means how well the words hang together – computed from taking a conditional probability score from frequent words; it is essentially a measure of human interpretability. Thus, low residual and high semantic coherence is preferred. In this case, as k passes 5, the residuals seemingly take a sharp dive as k increases. Thus, choose 5 as the k.


ktest = searchK(documents = pro_prep$documents,
                vocab = pro_prep$vocab, 
                K = c(3, 4, 5, 10, 20), verbose = FALSE)

plot(ktest)

Wih the 5 topics in hand, compute the actual model. The expected topic proportion plot is as follows. Here, topic 4 has the highest expected topic proportion, which kind of makes sense in real life context: as this is the topic modeling on the pro reviews, it is reasonable for words like great, work, company to be included in the highest mentioned topic.


topics5 <- stm(documents = pro_prep$documents,
               vocab = pro_prep$vocab, seed = 1001,
               K = 5, verbose = FALSE)

plot(topics5)

Now, see what emerges from the topics in details. FREX words are probably what we should focus on here, since they occur frequently within the topic and are exclusive to that topic. Here, in topic 4: FEEX words are great, company, learn, environment, get, life, salary - which are all rational for pro reviews. Also, the highest probability words are worthy of attention because there are the words with the highest probability of occuring within that topic. Here, under topic 4, highest probability words are: work, great, company, learn, environment, get, project - which are in high accordance with the FEEX words.


labelTopics(topics5)

Topic 1 Top Words:
     Highest Prob: good, people, management, benefit, nice, job, experience 
     FREX: good, people, benefit, nice, job, exposure, fun 
     Lift: amcf, busy, cant, guide, immediate, induction, london 
     Score: good, people, benefit, nice, management, job, exposure 
Topic 2 Top Words:
     Highest Prob: team, can, firm, office, provide, year, consultant 
     FREX: office, provide, consultant, orga, also, need, family 
     Lift: apart, background, content, dress, ensure, external, fairly 
     Score: office, firm, team, consultant, need, professional, provide 
Topic 3 Top Words:
     Highest Prob: employee, lot, pay, culture, friendly, place, balance 
     FREX: employee, pay, culture, balance, consult, orgd, small 
     Lift: advisor, analytics, associate, atmosphäre, base, class, clientele 
     Score: employee, pay, orgd, culture, lot, place, consult 
Topic 4 Top Words:
     Highest Prob: work, great, company, learn, environment, get, project 
     FREX: great, company, learn, environment, get, life, salary 
     Lift: abwechslungsreiche, appear, arbeiten, assignment, auch, aufgaben, beratungsunternehmen 
     Score: work, great, company, learn, environment, get, life 
Topic 5 Top Words:
     Highest Prob: opportunity, client, time, make, flexible, help, train 
     FREX: client, time, make, train, support, orgc, really 
     Lift: advantage, already, ample, analyst, approachable, bankrupt, certification 
     Score: opportunity, client, orgc, make, time, give, train 

Now, see what reviews that have higher probabilities of being associated with each topic.


findThoughts(topics5, texts = pro_prep$meta$pros, n = 1)

 Topic 1: 
     Quiet little stressful Very good family Very good environment Nice ORGC town Good economy Good sport program 
 Topic 2: 
     When I first interview ORGD STEP STRATEGY see Glassdoor review I wrongfully doubt value summer internship program However spend summer ORGD STEP STRATEGY I thank Sharona team enough As University Virginia student I come program experience Series B start management consult experience To surprise I place beautiful open office overlook mountain range Brentwood one affluent neighborhood Los Angeles California On one first day I take site credit retailer  million annual revenue work principal system analysis account merchandise floor plan customer relation system The office run quite like start company Each intern come elite school country bring unique perspective network program I privilege pitch idea principal firm high profile connection top CEOs Los Angeles Israel When idea determine valuable I give freedom work project throughout summer culminate event August meet C Suite management cut edge virtual reality augment reality gaming entertainment ad tech venture capital angel invest production cybersecurity technology firm Whether high growth start Silicon Beach just close M fund round billion dollar establish enterprise I see first hand Sharona command respect admiration power player business We even get meet Yossi Vardi Godfather Israel s hello tech industry now  start ecosystem world   billion invest first half  next Silicon Valley Yes Sharona high expectation work deliver expectation help intern develop professional I believe previous Glassdoor reviewer just may ability take constructive criticism use inspire good work habit Each day access kitchen fill snack drink In fact either go free lunch order free lunch office discuss recent be A activity business news meal Bottom line I appreciate Sharona develop professionally turn amateur management consultant analyst contact throughout business world experience multi million dollar client I even recommend program good know firm ability work intimately experience Pw C principal give experience access I just another intern bottom totem pole big firm 
 Topic 3: 
     Volle Integration die Projektteams interessante Projekte auf Topmanagement Level sympathische und smarte Kollegen gute Work Life Balance für consult 
 Topic 4: 
     Kulturschnelles Wachstum Work Life Balance Ausgleich zwischen Beruf und Freizeit Umgang mit dem Senior Management des Kunden 
 Topic 5: 
     Never see company share knowledge freely everyone even competitor Civil engineer learn electrical plan design electrical engineer learn civil structure anything possible wish They make complete engineer Non metro grade A city benefit flexi time ease commute

Next step, since topic models are probablistic, compute he probabilities of each document belonging to a topic.


head(topics5$theta, 15)

            [,1]       [,2]      [,3]      [,4]      [,5]
 [1,] 0.22539466 0.12575379 0.1370302 0.3161109 0.1957105
 [2,] 0.20783320 0.11708519 0.1620109 0.3327056 0.1803651
 [3,] 0.21384813 0.15223967 0.1689604 0.2796603 0.1852915
 [4,] 0.16964886 0.17469019 0.1524817 0.2305308 0.2726484
 [5,] 0.22389613 0.11284065 0.1546371 0.3417577 0.1668684
 [6,] 0.34155517 0.08164882 0.1242106 0.3525590 0.1000265
 [7,] 0.23563987 0.17120546 0.1657124 0.2761865 0.1512558
 [8,] 0.10443301 0.30396297 0.2232076 0.1628533 0.2055431
 [9,] 0.13258126 0.31363639 0.2109955 0.1873766 0.1554103
[10,] 0.09838425 0.17840294 0.1401331 0.2625235 0.3205563
[11,] 0.15684586 0.27350053 0.1615222 0.2436263 0.1645051
[12,] 0.26129997 0.14524674 0.1377146 0.3308216 0.1249171
[13,] 0.12951030 0.17127237 0.2118687 0.3298087 0.1575399
[14,] 0.11547104 0.30081425 0.2579578 0.1702397 0.1555172
[15,] 0.27911686 0.17869586 0.1375674 0.2808503 0.1237696

Since topic models, on the other hand, are great for predicting topic probabilities for unseen document.


newprotext = textProcessor(documents = glassDoor_mod$pros[holdoutRows],
                           metadata = glassDoor_mod[holdoutRows, ], 
                           stem = FALSE)

Building corpus... 
Converting to Lower Case... 
Removing punctuation... 
Removing stopwords... 
Removing numbers... 
Creating Output... 

newprocorp = alignCorpus(new = newprotext, old.vocab = topics5$vocab)

Removing 1 Documents with No Words 
Your new corpus now has 99 documents, 517 non-zero terms of 1534 total terms in the original set. 
187 terms from the new data did not match.
This means the new data contained 33.7% of the old terms
and the old data contained 73.4% of the unique terms in the new data. 
You have retained 1247 tokens of the 1444 tokens you started with (86.4%).

newprofitted = fitNewDocuments(model = topics5, documents = newprocorp$documents,
                               newData = newprocorp$meta, origData = pro_prep$meta)

...................................................................................................

Con Reviews

Preparation work


con_tm <- glassDoor_tm %>% as.data.frame()
colnames(con_tm) <- c("doc_id", "text", "rating")
conCorpus = Corpus(DataframeSource(con_tm))

conTDM <- TermDocumentMatrix(conCorpus, control = list(weighting = 
                                                         function(x)
                                                           weightTfIdf(x, normalize = FALSE)))

inspect(conTDM)

<<TermDocumentMatrix (terms: 3502, documents: 4)>>
Non-/sparse entries: 4221/9787
Sparsity           : 70%
Maximal term length: 25
Weighting          : term frequency - inverse document frequency (tf-idf)
Sample             :
           Docs
Terms             ORGA       ORGB ORGC       ORGD
  delivery    0.000000  10.000000    0   0.000000
  ensure      0.000000   0.000000   10   0.000000
  find        6.225562   2.905262    0   4.150375
  fmc         8.000000   0.000000    0   2.000000
  orga      132.000000   0.000000    0   0.000000
  orgb        0.000000 112.000000    0   0.000000
  orgc        0.000000   0.000000  144   0.000000
  orgcpany    0.000000   0.000000   10   0.000000
  orgd        0.000000   0.000000    0 180.000000
  something   4.000000   8.000000    0   0.000000

conConvert <- as.data.frame(as.matrix(conTDM))

conTibble = as_tibble(conConvert, .name_repair = "universal")

conTibble <- conTibble %>%
  mutate_all(., funs(. + .1))

rownames(conTibble) <- rownames(conConvert)

conNMF <- nmf(conTibble, 4, seed = 1001)

save(conNMF, file = "/Users/katliyx/Documents/SPRING2020/ITAO70250-Unstructured/Homework/HW2/nmfOut2.Rata")

load("/Users/katliyx/Documents/SPRING2020/ITAO70250-Unstructured/Homework/HW2/nmfOut2.Rata")

wMatrix2 = as.data.frame(basis(conNMF))

head(wMatrix2[order(-wMatrix2$V1), ], 5)

                    V1           V2           V3           V4
orgb         15438.749 2.220446e-16 2.220446e-16 3.908536e-06
delivery      1390.977 1.413403e+01 1.098141e+01 1.611776e+01
enviornment   1115.531 1.441116e+01 1.119673e+01 1.643379e+01
satisfaction  1115.531 1.441116e+01 1.119673e+01 1.643379e+01
something     1115.114 1.394087e+01 4.943255e+02 1.589749e+01

head(wMatrix2[order(-wMatrix2$V2), ], 5)

                      V1         V2           V3           V4
orgc        3.668671e-05 22413.2119 2.220446e-16 2.220446e-16
ensure      1.279024e+01  1570.9150 1.122066e+01 1.646891e+01
orgcpany    1.279024e+01  1570.9150 1.122066e+01 1.646891e+01
orgcployees 1.298114e+01  1259.8359 1.138813e+01 1.671471e+01
attention   1.317204e+01   948.7569 1.155560e+01 1.696052e+01

head(wMatrix2[order(-wMatrix2$V3), ], 5)

                 V1           V2         V3           V4
orga   2.220446e-16 1.743540e-09 15955.3089 2.220446e-16
fmc    1.275900e+01 1.440669e+01   978.2945 3.714140e+02
find   4.129018e+02 1.402733e+01   763.6384 7.526571e+02
air    1.311997e+01 1.481427e+01   736.7512 1.689348e+01
ardent 1.311997e+01 1.481427e+01   736.7512 1.689348e+01

head(wMatrix2[order(-wMatrix2$V4), ], 5)

                   V1           V2        V3        V4
orgd     2.220453e-16 2.220816e-16  10.15161 31948.676
sharona  1.313385e+01 1.482995e+01  11.97329  1436.852
next     1.302973e+01 1.471238e+01 132.75550  1436.718
ceos     1.328657e+01 1.500239e+01  11.99447  1082.064
critical 1.328657e+01 1.500239e+01  11.99447  1082.064

set.seed(1001)

holdoutRows <- sample(1:nrow(glassDoor_mod), 100, replace = FALSE)

con_text <- textProcessor(documents = glassDoor_mod$cons[-c(holdoutRows)],
                          metadata = glassDoor_mod[-c(holdoutRows), ],
                          stem = FALSE)

Building corpus... 
Converting to Lower Case... 
Removing punctuation... 
Removing stopwords... 
Removing numbers... 
Creating Output... 

con_prep = prepDocuments(documents=con_text$documents,
                         vocab = con_text$vocab,
                         meta = con_text$meta)

Removing 2158 of 4033 terms (2158 of 21548 tokens) due to frequency 
Removing 3 Documents with No Words 
Your corpus now has 1573 documents, 1875 terms and 19390 tokens.

In this case, as k passes 5, the residuals seemingly take a sharp dive as k increases. Thus, choose 5 as the k - yet again.


ktest2 = searchK(documents = con_prep$documents,
                vocab = con_prep$vocab, 
                K = c(3, 4, 5, 10, 20), verbose = FALSE)

plot(ktest2)

Wih the 5 topics in hand, compute the actual model. The expected topic proportion plot is as follows. Here, topic 5 has the highest expected topic proportion, which contains words like company, good and much. Here “good” appears to be a bit dubious. The topic with the second highest expected proportion is topic 1.


topics5_2 <- stm(documents = con_prep$documents,
               vocab = con_prep$vocab, seed = 1001,
               K = 5, verbose = FALSE)

plot(topics5_2)

Now, see what emerges from the topics in details. FREX words are probably what we should focus on here, since they occur frequently within the topic and are exclusive to that topic. Here, in topic 5: FEEX words are company, good, lack, little, need, low, business - with the presence of words like “lack” and “need”, this topic being expected of high proportion is more rational now. Also, the highest probability words are worthy of attention because there are the words with the highest probability of occuring within that topic. Again, under topic 4, highest probability words are: company, good, much, lack, little, need, low. Here, the addition of “low” just makes more sense of topic 5’s pertaining to con reviews.


labelTopics(topics5_2)

Topic 1 Top Words:
     Highest Prob: work, project, life, lot, long, balance, hard 
     FREX: project, life, lot, balance, hard, sometimes, compensation 
     Lift: acquisition, adapt, admin, affect, agree, allocation, als 
     Score: work, balance, life, und, die, project, sometimes 
Topic 2 Top Words:
     Highest Prob: can, people, hour, staff, firm, consult, year 
     FREX: people, staff, firm, small, new, orga, contract 
     Lift: acquire, active, always, apart, behind, bottom, bureaucratic 
     Score: can, staff, firm, people, year, support, none 
Topic 3 Top Words:
     Highest Prob: management, employee, pay, don, make, place, take 
     FREX: pay, take, even, leave, lead, week, never 
     Lift: act, advertise, analysis, begin, bond, bus, butt 
     Score: pay, employee, management, even, leave, take, don 
Topic 4 Top Words:
     Highest Prob: get, time, client, office, high, like, manager 
     FREX: time, client, office, high, like, opportunity, really 
     Lift: anywhere, avoid, base, box, cause, chase, club 
     Score: get, time, client, know, like, high, travel 
Topic 5 Top Words:
     Highest Prob: company, good, much, lack, little, need, low 
     FREX: company, good, lack, little, need, low, business 
     Lift: alabama, capacity, child, clean, competitor, conservative, coverage 
     Score: company, lack, good, team, con, little, business 

Now, see what reviews that have higher probabilities of being associated with each topic.


findThoughts(topics5_2, texts = con_prep$meta$cons, n = 1)

 Topic 1: 
     Schwerfällige und langsame Prozesse Projekte ohne wirkliche Auswirkungen Initiativen und Programme die nach Weiterleitung auf Geschäftsebenen verwässern Erfolglose Umstrukturierungsversuche 
 Topic 2: 
     Salary always play catch Mc Kinsey s Bain s BCG s result always seem bite behind The performance review process seem bite vague effort always link reward bonus promotion promotion affect good First level bonus almost unheard drop precipitously mark Work can dynamic interest gaggle mundane project good The project one staff luck draw room system advocate interest project exist 
 Topic 3: 
     The list endless  appoint team manager autocratic approach Lunch break You force take half n hour lunch break team wish break timing also give reason literally say yes everything wish sustain organisation Else humiliate ask give resignation Also put false allegation resignation reason Unfair treatment  due diligence give employee s feedback Half day salary Office timing start am can report   late   half day salary deduct Which also act de motivation s good come half day    instead get half day salary deduct minute  You allow take early leave leave actually require even ready submit medical document  s another funny thing though have keep tea coffee vend machine machine operation just twice day have specifically keep fix hour mean can t even drink tea coffee per convenience do per wish No proper train give even thefresher candidate  good place growth high attrition rate also mention reason 
 Topic 4: 
     frustrate people receive lead generator actually make contact account make introductory call next day send sheet time owner contact expect state time date state form In case inaccurate  time So egg come chicken opportunity sale decrease near zero Also case I zero train manager assistant area manager something I promise occur I take position 
 Topic 5: 
     liar lack communication dishon ORGDst l ORGDad ORGDrship th ORGD gov ORGDrnm ORGDnt sit ORGD can t manag ORGD th ORGD ORGDmploy ORGDORGDs Mak ORGD ORGDxcus ORGDs l ORGDtting som ORGDon ORGD go win t list ORGDn th ORGD ORGDmploy ORGDORGD Would r ORGDcomm ORGDnd anyon ORGD job work company

Next step, since topic models are probablistic, compute he probabilities of each document belonging to a topic.


head(topics5_2$theta, 15)

            [,1]      [,2]       [,3]       [,4]       [,5]
 [1,] 0.22869541 0.1729554 0.13122362 0.13299962 0.33412595
 [2,] 0.39414666 0.2361382 0.08578816 0.11323388 0.17069307
 [3,] 0.11002442 0.1577489 0.28969997 0.26877928 0.17374742
 [4,] 0.10308559 0.1519916 0.28320292 0.30294445 0.15877542
 [5,] 0.32811454 0.2071410 0.10056000 0.13780875 0.22637573
 [6,] 0.37267083 0.1215989 0.25455147 0.15322175 0.09795707
 [7,] 0.27169774 0.2030026 0.13030132 0.17937442 0.21562398
 [8,] 0.13755266 0.1444468 0.20128626 0.41771536 0.09899892
 [9,] 0.11144815 0.1944214 0.17141528 0.15262701 0.37008812
[10,] 0.13810604 0.2436554 0.16594474 0.19072494 0.26156883
[11,] 0.13909243 0.2399167 0.23121935 0.18332491 0.20644664
[12,] 0.08990613 0.1491018 0.14312706 0.11859870 0.49926632
[13,] 0.08990613 0.1491018 0.14312706 0.11859870 0.49926632
[14,] 0.67112038 0.1103481 0.05497527 0.07829753 0.08525876
[15,] 0.38029277 0.1744278 0.14968240 0.16452531 0.13107167

As for the advice reviews…

Preparation work


advice_tm <- glassDoor_tm %>% as.data.frame()
colnames(advice_tm) <- c("doc_id", "text", "rating")
adviceCorpus = Corpus(DataframeSource(advice_tm))

adviceTDM <- TermDocumentMatrix(adviceCorpus, control = list(weighting = 
                                                         function(x)
                                                           weightTfIdf(x, normalize = FALSE)))

inspect(adviceTDM)

<<TermDocumentMatrix (terms: 3502, documents: 4)>>
Non-/sparse entries: 4221/9787
Sparsity           : 70%
Maximal term length: 25
Weighting          : term frequency - inverse document frequency (tf-idf)
Sample             :
           Docs
Terms             ORGA       ORGB ORGC       ORGD
  delivery    0.000000  10.000000    0   0.000000
  ensure      0.000000   0.000000   10   0.000000
  find        6.225562   2.905262    0   4.150375
  fmc         8.000000   0.000000    0   2.000000
  orga      132.000000   0.000000    0   0.000000
  orgb        0.000000 112.000000    0   0.000000
  orgc        0.000000   0.000000  144   0.000000
  orgcpany    0.000000   0.000000   10   0.000000
  orgd        0.000000   0.000000    0 180.000000
  something   4.000000   8.000000    0   0.000000

adviceConvert <- as.data.frame(as.matrix(adviceTDM))

adviceTibble = as_tibble(adviceConvert, .name_repair = "universal")

adviceTibble <- adviceTibble %>%
  mutate_all(., funs(. + .1))

rownames(adviceTibble) <- rownames(adviceConvert)

adviceNMF <- nmf(adviceTibble, 4, seed = 1001)

save(adviceNMF, file = "/Users/katliyx/Documents/SPRING2020/ITAO70250-Unstructured/Homework/HW2/nmfOut3.Rata")

load("/Users/katliyx/Documents/SPRING2020/ITAO70250-Unstructured/Homework/HW2/nmfOut3.Rata")

wMatrix3 = as.data.frame(basis(adviceNMF))

head(wMatrix3[order(-wMatrix3$V1), ], 5)

                    V1           V2           V3           V4
orgb         15438.749 2.220446e-16 2.220446e-16 3.908536e-06
delivery      1390.977 1.413403e+01 1.098141e+01 1.611776e+01
enviornment   1115.531 1.441116e+01 1.119673e+01 1.643379e+01
satisfaction  1115.531 1.441116e+01 1.119673e+01 1.643379e+01
something     1115.114 1.394087e+01 4.943255e+02 1.589749e+01

head(wMatrix3[order(-wMatrix3$V2), ], 5)

                      V1         V2           V3           V4
orgc        3.668671e-05 22413.2119 2.220446e-16 2.220446e-16
ensure      1.279024e+01  1570.9150 1.122066e+01 1.646891e+01
orgcpany    1.279024e+01  1570.9150 1.122066e+01 1.646891e+01
orgcployees 1.298114e+01  1259.8359 1.138813e+01 1.671471e+01
attention   1.317204e+01   948.7569 1.155560e+01 1.696052e+01

head(wMatrix3[order(-wMatrix3$V3), ], 5)

                 V1           V2         V3           V4
orga   2.220446e-16 1.743540e-09 15955.3089 2.220446e-16
fmc    1.275900e+01 1.440669e+01   978.2945 3.714140e+02
find   4.129018e+02 1.402733e+01   763.6384 7.526571e+02
air    1.311997e+01 1.481427e+01   736.7512 1.689348e+01
ardent 1.311997e+01 1.481427e+01   736.7512 1.689348e+01

head(wMatrix3[order(-wMatrix3$V4), ], 5)

                   V1           V2        V3        V4
orgd     2.220453e-16 2.220816e-16  10.15161 31948.676
sharona  1.313385e+01 1.482995e+01  11.97329  1436.852
next     1.302973e+01 1.471238e+01 132.75550  1436.718
ceos     1.328657e+01 1.500239e+01  11.99447  1082.064
critical 1.328657e+01 1.500239e+01  11.99447  1082.064

set.seed(1001)

holdoutRows <- sample(1:nrow(glassDoor_mod), 100, replace = FALSE)

advice_text <- textProcessor(documents = glassDoor_mod$advice[-c(holdoutRows)],
                          metadata = glassDoor_mod[-c(holdoutRows), ],
                          stem = FALSE)

Building corpus... 
Converting to Lower Case... 
Removing punctuation... 
Removing stopwords... 
Removing numbers... 
Creating Output... 

advice_prep = prepDocuments(documents=advice_text$documents,
                         vocab = advice_text$vocab,
                         meta = advice_text$meta)

Removing 1126 of 2036 terms (1126 of 8407 tokens) due to frequency 
Removing 15 Documents with No Words 
Your corpus now has 934 documents, 910 terms and 7281 tokens.

In this case, as k passes 5, the residuals seemingly take a sharp dive as k increases - that drop is pretty spectacular. Thus, choose 5 as the k - again and again.


ktest3 = searchK(documents = advice_prep$documents,
                vocab = advice_prep$vocab, 
                K = c(3, 4, 5, 10, 20), verbose = FALSE)

plot(ktest3)

For comments in the advice column, topic 3 has the highest expected topic proportion, and topic 5 follows. It is interesting to see words like “train” pop up since implication as “training opportunities” is pertinent to advice from many perspectives.


topics5_3 <- stm(documents = advice_prep$documents,
               vocab = advice_prep$vocab, seed = 1001,
               K = 5, verbose = FALSE)

plot(topics5_3)

Now, see what emerges from the topics in details. FREX words are probably what we should focus on here, since they occur frequently within the topic and are exclusive to that topic. Here, in topic 3: FEEX words are good, work, company, need, get, take, manager. “Need” really tells something about advice. Also, the highest probability words are worthy of attention because there are the words with the highest probability of occuring within that topic. Again, under topic 3, highest probability words are: employee, good, work, company, need, get, take. Here, high probability words kind of echo with the FREX words.


labelTopics(topics5_3)

Topic 1 Top Words:
     Highest Prob: make, continue, client, team, respect, want, increase 
     FREX: continue, client, team, increase, career, stay, resource 
     Lift: instead, may, phone, practice, relationship, site, sure 
     Score: continue, team, make, client, practice, sure, increase 
Topic 2 Top Words:
     Highest Prob: management, don, time, pay, way, job, try 
     FREX: management, don, time, pay, way, job, level 
     Lift: job, advise, almost, also, another, answer, anyone 
     Score: management, don, time, pay, way, job, long 
Topic 3 Top Words:
     Highest Prob: employee, good, work, company, need, get, take 
     FREX: good, work, company, need, get, take, manager 
     Lift: acquire, adhere, amount, analyst, approach, assume, bright 
     Score: good, employee, work, company, need, get, just 
Topic 4 Top Words:
     Highest Prob: keep, staff, hire, great, treat, focus, much 
     FREX: keep, hire, great, treat, focus, much, place 
     Lift: action, base, consistent, cut, direction, everyone, focus 
     Score: keep, staff, hire, great, much, treat, place 
Topic 5 Top Words:
     Highest Prob: people, business, train, new, can, give, like 
     FREX: people, business, new, give, like, please, project 
     Lift: add, area, consultant, deliver, develop, experience, honest 
     Score: people, new, project, learn, like, give, business 

Now, see what reviews that have higher probabilities of being associated with each topic.


findThoughts(topics5_3, texts = advice_prep$meta$advice, n = 1)

 Topic 1: 
     Communicate little The Street want hear instead talk Front Line want hear question 
 Topic 2: 
     If I John Burgess owner I start fix tarnish lead database isn t hard If  bad lead mention fix raise sale  AND I WOULD FIGURE OUT A WAY TO PAY employee FOR THEIR OWN PERFORMANCE AND NOT FOR THE PERFORMANCE OF other 
 Topic 3: 
     Look employee listen one job can probably help figure work doesn t Retain employee massive turnover good employee client show unstable company employee afraid job review claim Get organize make plan Randomly change process policy without clear plan employee create chaos even harsh environment already Communicate company s plan employee whether sell company improve plan anything work 
 Topic 4: 
     I recommend much keep awesome Christmas party keep keep fun 
 Topic 5: 
     Giv ORGD th ORGD mang ORGDrs train b ORGDing b ORGDtt ORGDr communicator train List ORGDn th ORGD ORGDmbarrassing ORGDmploy ORGDORGDs Th ORGDy ar ORGD human have ORGDs ORGDrv ORGDd b ORGD h ORGDard

Next step, since topic models are probablistic, compute he probabilities of each document belonging to a topic.


head(topics5_3$theta, 15)

            [,1]       [,2]      [,3]       [,4]      [,5]
 [1,] 0.14627924 0.20513466 0.2427464 0.15521592 0.2506238
 [2,] 0.09978360 0.23643414 0.2586945 0.22856616 0.1765216
 [3,] 0.08243238 0.11316495 0.3046714 0.13412719 0.3656040
 [4,] 0.09270360 0.13040056 0.1333031 0.06433677 0.5792560
 [5,] 0.17558943 0.17532593 0.2312919 0.18935416 0.2284386
 [6,] 0.07390013 0.23863909 0.3091568 0.23192541 0.1463786
 [7,] 0.16758335 0.11784704 0.2242318 0.20023769 0.2901001
 [8,] 0.15818393 0.15875001 0.2419165 0.16449827 0.2766513
 [9,] 0.09326340 0.23056989 0.2839878 0.22495201 0.1672269
[10,] 0.07182122 0.50233641 0.2006376 0.11244249 0.1127623
[11,] 0.09503489 0.27850053 0.2773401 0.20889718 0.1402273
[12,] 0.11441160 0.25884813 0.2562035 0.17853134 0.1920055
[13,] 0.19742083 0.16236406 0.1610574 0.16301120 0.3161465
[14,] 0.08381568 0.06424385 0.1036296 0.08203162 0.6662793
[15,] 0.11441160 0.25884813 0.2562035 0.17853134 0.1920055

XIII. How does rating contribute to the mapping of topics?

Pro Reviews

Preparation work


glassDoor3 <- glassDoor_mod %>%
  drop_na(rating)

predictortext <- textProcessor(documents = glassDoor3$pros,
                               metadata = glassDoor3,
                               stem = FALSE)

Building corpus... 
Converting to Lower Case... 
Removing punctuation... 
Removing stopwords... 
Removing numbers... 
Creating Output... 

proprep2 <- prepDocuments(documents = predictortext$documents,
                          vocab = predictortext$vocab,
                          meta = predictortext$meta)

Removing 1863 of 3454 terms (1863 of 21932 tokens) due to frequency 
Removing 1 Documents with No Words 
Your corpus now has 1676 documents, 1591 terms and 20069 tokens.

The Prediction Models

For the topics of pro reviews, we could see that rating has significant impact on topics 1, 5, and on topic 4.


topicpredictor = stm(documents = proprep2$documents,
                     vocab = proprep2$vocab, prevalence = ~rating,
                     data = proprep2$meta, K = 5, verbose = FALSE)

proratingeffect = estimateEffect(1:5 ~ rating, stmobj = topicpredictor,
                                 metadata = proprep2$meta)

summary(proratingeffect, topics = c(1:5))

Call:
estimateEffect(formula = 1:5 ~ rating, stmobj = topicpredictor, 
    metadata = proprep2$meta)


Topic 1:

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.118978   0.004797   24.80   <2e-16 ***
rating      0.016395   0.001569   10.45   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Topic 2:

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.212271   0.006302  33.683   <2e-16 ***
rating      0.003071   0.002191   1.402    0.161    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Topic 3:

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.242782   0.006881  35.282   <2e-16 ***
rating      -0.001880   0.002338  -0.804    0.421    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Topic 4:

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.199226   0.006519  30.563   <2e-16 ***
rating      -0.004040   0.002254  -1.792   0.0733 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Topic 5:

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.226795   0.008510  26.649  < 2e-16 ***
rating      -0.013545   0.002864  -4.729 2.44e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Plots.

Here, in each plot for each topic, the x-axes denote the ratings while the y-axes denote the corresponding expected topic proportion. As the rating decreases, the contents of pro reviews are mostly likely to fall under topic 3 (notice that there are words like “small”, which might imply “small company” or “small team”, which might work adversely when it comes to rating), then topic 5. As the rating increases, the contents of pro reviews tend to fall under topic 2.


plot.estimateEffect(proratingeffect, "rating", method = "continuous", model = topicpredictor, topics = 1, labeltype = "frex")


plot.estimateEffect(proratingeffect, "rating", method = "continuous", model = topicpredictor, topics = 2, labeltype = "frex")


plot.estimateEffect(proratingeffect, "rating", method = "continuous", model = topicpredictor, topics = 3, labeltype = "frex")


plot.estimateEffect(proratingeffect, "rating", method = "continuous", model = topicpredictor, topics = 4, labeltype = "frex")


plot.estimateEffect(proratingeffect, "rating", method = "continuous", model = topicpredictor, topics = 5, labeltype = "frex")

Con Reviews

Preparation work


predictortext2 <- textProcessor(documents = glassDoor3$cons,
                               metadata = glassDoor3,
                               stem = FALSE)

Building corpus... 
Converting to Lower Case... 
Removing punctuation... 
Removing stopwords... 
Removing numbers... 
Creating Output... 

conprep2 <- prepDocuments(documents = predictortext2$documents,
                          vocab = predictortext2$vocab,
                          meta = predictortext2$meta)

Removing 2212 of 4146 terms (2212 of 22558 tokens) due to frequency 
Removing 2 Documents with No Words 
Your corpus now has 1674 documents, 1934 terms and 20346 tokens.

The Prediction Models

For the topics of pro reviews, we could see that rating has significant impact on all topics from 1 to 5.


topicpredictor2 = stm(documents = conprep2$documents,
                     vocab = conprep2$vocab, prevalence = ~rating,
                     data = conprep2$meta, K = 5, verbose = FALSE)

conratingeffect = estimateEffect(1:5 ~ rating, stmobj = topicpredictor2,
                                 metadata = conprep2$meta)

summary(conratingeffect, topics = c(1:5))

Call:
estimateEffect(formula = 1:5 ~ rating, stmobj = topicpredictor2, 
    metadata = conprep2$meta)


Topic 1:

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.040494   0.009703  -4.173 3.16e-05 ***
rating       0.074916   0.003421  21.901  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Topic 2:

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.23574    0.01020  23.112  < 2e-16 ***
rating      -0.01452    0.00335  -4.334 1.55e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Topic 3:

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.412832   0.013944   29.61   <2e-16 ***
rating      -0.062860   0.004449  -14.13   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Topic 4:

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.266247   0.011423  23.308  < 2e-16 ***
rating      -0.022380   0.003571  -6.267 4.66e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Topic 5:

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.125924   0.010924  11.527  < 2e-16 ***
rating      0.024727   0.003734   6.622 4.76e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Plots.

Here, in each plot for each topic, the x-axes denote the ratings while the y-axes denote the corresponding expected topic proportion. As the rating increases, the con reviews are most likely to fall under topic 1; as the rating decreases, the con reviews are most likely to fall under topic 3.


plot.estimateEffect(conratingeffect, "rating", method = "continuous", model = topicpredictor, topics = 1, labeltype = "frex")


plot.estimateEffect(conratingeffect, "rating", method = "continuous", model = topicpredictor, topics = 2, labeltype = "frex")


plot.estimateEffect(conratingeffect, "rating", method = "continuous", model = topicpredictor, topics = 3, labeltype = "frex")


plot.estimateEffect(conratingeffect, "rating", method = "continuous", model = topicpredictor, topics = 4, labeltype = "frex")


plot.estimateEffect(conratingeffect, "rating", method = "continuous", model = topicpredictor, topics = 5, labeltype = "frex")

Again, for the advice…

Preparation work


predictortext3 <- textProcessor(documents = glassDoor3$advice,
                               metadata = glassDoor3,
                               stem = FALSE)

Building corpus... 
Converting to Lower Case... 
Removing punctuation... 
Removing stopwords... 
Removing numbers... 
Creating Output... 

adviceprep2 <- prepDocuments(documents = predictortext3$documents,
                          vocab = predictortext3$vocab,
                          meta = predictortext3$meta)

Removing 1152 of 2101 terms (1152 of 8847 tokens) due to frequency 
Removing 15 Documents with No Words 
Your corpus now has 989 documents, 949 terms and 7695 tokens.

The Prediction Models

For the topics of advice, we could see that rating has significant impact on topics 2, 3, 4, and 5. Rating also has impact on topic 1, but probably not that significant compared with the rest.


topicpredictor3 = stm(documents = adviceprep2$documents,
                     vocab = adviceprep2$vocab, prevalence = ~rating,
                     data = adviceprep2$meta, K = 5, verbose = FALSE)

adviceratingeffect = estimateEffect(1:5 ~ rating, stmobj = topicpredictor3,
                                 metadata = adviceprep2$meta)

summary(adviceratingeffect, topics = c(1:5))

Call:
estimateEffect(formula = 1:5 ~ rating, stmobj = topicpredictor3, 
    metadata = adviceprep2$meta)


Topic 1:

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.158029   0.004212  37.516   <2e-16 ***
rating      -0.003035   0.001303  -2.329   0.0201 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Topic 2:

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.222876   0.005778  38.574  < 2e-16 ***
rating      -0.007847   0.001724  -4.553 5.96e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Topic 3:

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.166785   0.005989  27.847   <2e-16 ***
rating      0.006539   0.002130   3.069   0.0022 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Topic 4:

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.232235   0.007277  31.916  < 2e-16 ***
rating      -0.009205   0.002254  -4.084 4.78e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Topic 5:

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.220150   0.005818  37.838  < 2e-16 ***
rating      0.013508   0.001820   7.422 2.48e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Plots.

Here, in each plot for each topic, the x-axes denote the ratings while the y-axes denote the corresponding expected topic proportion. As the rating increases, the advice contents is likely to fall under topic 5. As the rating gets lower, the advice contents is likely to fall under topic 4 and then topic 2, with a small difference in the expected topic proportion.


plot.estimateEffect(adviceratingeffect, "rating", method = "continuous", model = topicpredictor, topics = 1, labeltype = "frex")


plot.estimateEffect(adviceratingeffect, "rating", method = "continuous", model = topicpredictor, topics = 2, labeltype = "frex")


plot.estimateEffect(adviceratingeffect, "rating", method = "continuous", model = topicpredictor, topics = 3, labeltype = "frex")


plot.estimateEffect(adviceratingeffect, "rating", method = "continuous", model = topicpredictor, topics = 4, labeltype = "frex")


plot.estimateEffect(adviceratingeffect, "rating", method = "continuous", model = topicpredictor, topics = 5, labeltype = "frex")

XIV. Conclusion

To answer the questions raised earlier in the introduction.
(1) The words with highest term frequencies appear similar in pro reviews, con reviews, and advice for the four organizations. However, the unigrams of higher tf-idf’s differ in pros, cons, and advice for every organization. Bigrams with higher presence differ for pro reviews, con reviews, and advice.
(2) Overall rating is the factor of interest to investigate on for this project.
(3) Sentiment reflected in pros, cons, and advice all impact the overall rating.
(4) Topic modelings for pro reviews, con reviews, and advice appear rational to some extent. For details, please refer back to the Topic Modeling section.
(5) Rating does have certain impact in the mapping of topics for pros, cons, and advice.

Regrettably, the word clouds seem not to be working that well (reasons to be discovered later). Also, I tried to clean up the German shown in the text, but that seems not to be that thorough as some German pops up in topic modeling.

XV. Final Thoughts & Limitations & Possible Improvement

Regrettably, the word clouds seem not to be working that well (reasons to be discovered later). Also, I tried to clean up the German shown in the text, but that seems not to be that thorough as some German pops up in topic modeling.